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Abstract — The existence of significant amount of correlation 
in the network traffic has stimulated the development of in- 
network traffic reduction techniques since end-to-end universal 
compression solutions would not perform well over fnternet 
packets due to the finite-length nature of data. Recently, we 
proposed a memory-assisted universal compression technique 
that holds a significant promise for reducing the amount of 
traffic in the networks. The idea is based on the observation 
that if a finite-length sequence from a server (source) is to be 
compressed and transmitted over the network, the associated 
universal code entails a substantial overhead. On the other hand, 
intermediate nodes can reduce the transmission overhead by 
memorizing the source statistics when forwarding the sequences 
from the previous communications with the server, f n this paper, 
we extend this idea to the scenario where multiple servers 
are present in the network by proposing distributed network 
compression via memory. We consider two spatially separated 
sources with correlated unknown source parameters. We wish 
to study the universal compression of a sequence of length 
n from one of the sources provided that the decoder has 
access to (i.e., memorized) a sequence of length m from the 
other source, fn this setup, the correlation does not arise from 
symbol-by-symbol dependency of two outputs from the two 
sources (as in Slepian-Wolf setup), fnstead, the two sequences 
are correlated because they are originated from the two sources 
with unknown correlated parameters. The finite-length nature 
of the compression problem at hand requires considering a 
notion of almost lossless source coding, where coding incurs 
an error probability Pe{n) that vanishes as sequence length 
n grows to infinity. We obtain bounds on the redundancy of 
almost lossless codes when the decoder has access to a random 
memory of length m as a function of the sequence length 
n and the permissible error probability pe{n). Our results 
demonstrate that distributed network compression via memory 
has the potential to significantly improve over conventional 
end-to-end compression when sufficiently large memory from 
previous communications is available to the decoder. 

I. Introduction 

Several networking applications involve acquiring data 
from multiple distributed (i.e., spatially separated) sources 
that cannot communicate with each other These applications 
include acquiring digital/analog data from sensors m-Q, 
the CEO problem Q, Q, delivery of network packets in a 
content-centric network fSl, acquiring data from femtocell 
wireless networks |9|, |10|, acquiring data chunks from 
the cloud fm . IIT2I . etc. What is perhaps common in all 
of the above problems is the bandwidth limitation, i.e.. 



there is a fundamental capacity for the information that can 
be transmitted in the network infrastructure. Hence, data 
compression can significantly improve the performance in 
any of such applications. 

The premise of data compression broadly relies on the 
data being correlated. As one example, when data is gathered 
from multiple sensors that measure the same phenomenon 
(e.g., temperature), the readings from the sensors are clearly 
correlated. As another example, when chunks of the same 
file/content are acquired by a client in a content-centric 
network, the data chunks are correlated as they are originated 
from the same data server Further, data that is originated 
from a mirror server is correlated with data that comes 
from the original server The focus of this work is on the 
reduction of the wireless/wired data traffic from multiple 
sources by utilizing such correlations. The scope of this work 
is significant as high correlation levels as much as 90% have 
been reported in the wired/wireless Internet ttaffic data (T3\- 
[15 1, which has motivated a lot of research so as to reduce 
the traffic by utilizing such correlations. 

Existing solutions that utilize such correlations in order 
to reduce the data transmission in the Internet are lim- 
ited in scope. Application-level content caching lfT6l cannot 
utilize the packet-level redundancy and statistical correla- 
tions across the contents. Packet-level redundancy elimi- 
nation techniques ifTTl are ad-hoc in nature and can only 
remove duplicates of a big chunk of the data packet while 
they ignore the statistical correlations in the packet-level. 
Application-level universal compression lfT8l - ll20l techniques 
do not utilize packet-level redundancies and more importantly 
cannot utilize the correlations in data that are originated from 
spatially separated sources. Packet-level memory-assisted 
compression techniques (14], (21], (22] utilize the statistical 
correlation among the packets while its extension to multiple 
sources is not readily available. 

In this paper, we introduce and study distributed network 
compression via memory, where we assume that the un- 
known parameter vectors of the distributed sources follow 
a correlated statistical model. By distributed we mean that 
the sources are spatially separated and the encoders do not 
communicate with each other We stress that the nature of 
our problems in network compression involving multiple 




Fig. 1. The basic scenario of two-source memory-assisted compression. 

sources is fundamentally different from those addressed by 
the Slepian-Wolf (SW) coding and multi-terminal source 
coding in IT], El, lEl, Id- Here, instead of symbol-by- 
symbol correlation between the sequences as in SW setup or 
the correlated Gaussian model among several observations 
of a phenomenon, the correlation is due to the the source 
parameters being a priori unknown f 23l . 1241 . To clarify, 
considering the example in Fig. [T] with sources 5*1 and 5*2, 
and would be independent given that the source 
models are known. However, when the source parameter 
is unknown, j/™ and are correlated with each other 
through the information they contain about the unknown 
but correlated source parameters. The question, which incurs 
in distributed network compression via memory, is whether 
or not this correlation can be potentially leveraged by the 
encoder of 5*2 and the decoder at M in the decoding of 
using (from ^i) to reduce the codelength of a;". 

The rest of this paper is organized as follows. In Section HH 
we present the problem setup and the related work. In 
Section Hill we briefly review the necessary background and 
definitions. In Section HV] we present our main results on 
the redundancy. In Section |V] we provide discussion on the 
results. Finally Section IVTl concludes the paper 

II. Problem Setup and Related Work 

We present the memory-assisted network compression 
problem in the most basic scenario, shown in Fig. [T] con- 
sisting of two correlated sources located in nodes 5*1 and 
52, the intermediate relay node A/, and two client nodes Ci 
and C2. Let and x" denote two sequences with lengths 
m and n that are generated by Si and ^2, respectively. 
We assume that 5*1 has transmitted the sequence to 
Ci through the intermediate node M. We further assume 
that M is a memory unit, i.e., capable of memorizing the 
sequence y'"\ Next, at some later time, S2 wishes to send 
x" to C2 through the intermediate node M. At this time, 
is available to the decoder at M. Thus, the encoder at 
^2 can encode the sequence with the knowledge that y™ 
is available to the decoder at M, potentially improving the 
universal compression of x" on the path from S2 to M. Such 
a code is decoded by M before being forwarded to the final 
destination C2. A trivial lower bound on the expected number 
of bits necessary for transmitting x" on the S2-M path will 
be H{X^\Y™'). Our goal is to analyze the lower bound and 
its achievability in various settings. 

Slepian and Wolf already demonstrated that if the data 
streams from two sources Si and 5*2 have symbol-by-symbol 



correlation, the sequences can be compressed to their joint 
entropy when decoded at M \ The idea is based on com- 
pressing the jointly typical sequences (x", y"). As the length 
n of the sequences increases to infinity, the decoding of the 
sequence x" at M can be performed using an almost lossless 
code with the average length that asymptotically approaches 
the conditional entropy, i.e., H{X"\Y"), with asymptotically 
zero error probability, i.e., lim„^ocPe("-) = 0. On the other 
hand, if the decoder at M chooses not to utilize the side 
information provided by the sequence y" or the coding is 
performed strictly lossless^l the encoder at ^2 would have to 
encode the sequence x" irrespective to what has already been 
communicated between and M, which would in turn result 
in an average code length of H{X"). After relatively recent 
development of practical Slepian-Wolf (SW) coding schemes 
by Pradhan and Ramchandran [2J, SW coding has drawn a 
great deal of attention as a promising compression technique 
in many applications such as sensor networks (cf. LiJ and the 
references therein) and distributed video coding IS). 

The Slepian-Wolf theorem naturally suits applications 
where the (new) sequence x" from ^2 (in Fig. [T]i can be 
viewed as a noisy version of the (previously seen) sequence 
y™, such as data gathering from neighboring sensors that 
measure the same phenomenon. However, in many other 
scenarios, the compression of spatially separated sources 
cannot be modeled by the SW framework. Examples include 
the universal compression of data from multiple mirrors 
of a data server and acquiring data chunks in a content- 
centric network. In such applications, it is plausible to 
assume that the sources (5*1 and 52 in Fig. [TJ follow a 
correlated (sometimes even identical) statistical model that is 
a priori unknown (to the encoder and the decoder) requiring 
universal compression |23|, [261, |27|. We assume that the 
servers at Si and S2 are stationary and ergodic parametric 
information sources that are unknown to the coding scheme. 
The following example clarifies this model of correlation. 

As an example, assume that source Si is a server that 
generates Bernoulli random variables (RVs) with unknown 
source parameter 6. Further, assume that source ^2 is a 
mirror server in a different location with very similar content. 
Thus, source 6*2 is a Bernoulli RV generator with parameter 
(j), where we assume that follows a Gaussian distribution 
around d. (If the mirror servers contain the exact same content 
we may even assume that cj) — 9, i.e., the variance of (/> can 
be assumed to be equal to zero). Let the sequences y™ and 
x" be generated independently by the two servers 5*1 and 
52, respectively. In this setup, the sequence y™ is correlated 
with x" through the information that they carry about the 
unknown source parameters. For example, if most of the bits 
in y™ are I's, it is very likely that most of the bits in x" 
are also I's. The question is, assuming two sources Si and 
52 with correlated unknown parameters and having y™ from 

'Please see 1251 for the formal definition of strictly lossless and almost 
lossless codes. In short, the strictly lossless coding is more restrictive than 
almost lossless coding since it requires Vra;pe(n) = as apposed to 

lim„^cx) Pcin) = 0. 



Si memorized at the decoder at M, what is the achievable 
universal compression performance on x" at S2-M path and 
whether the correlation between and j/'" can be potentially 
leveraged by the encoder of S2 and the decoder at i\/ in the 
decoding of using to reduce the codelength of x". 

This problem can also be viewed as universal compres- 
sion IIT8I - II20I with training data that is only available to the 
decoder In f2Tl, f22l, we theoretically derived the gain that 
is obtained in the universal compression of the new sequence 
x" from 5*2 by memorizing (i.e., having access to) from 
Si at both the decoder (at AI) and the encoder (at 6*2). This 
corresponds to the reduced case of our problem where the 
sources Si and ^2 are either co-located (a single source) 
or allowed to communicate. For the reduced problem case, 
in IIT4I . Il28l . we further extended the setup to a network 
with a single source and derived bounds on the network- 
wide gain where a small fraction of the intermediate nodes 
in the network are capable of memorization. However, the 
extension to the multiple spatially separated sources, where 
the training data is only available to the decoder, is non-trivial 
and raises a new set of challenges that we aim to address. 

In II25I . we extended the network compression to dis- 
tributed identical sources in the special case where the 
sources were identical. We derived an upper bound on the 
achievable average minimax redundancy, where Si and ^2 
share an indexical parameter vector In this paper, we let 
the information sources at Si and S2 be parametric with d- 
dimensional parameter vectors 9 and (p, respectively. These 
parameter vectors are unknown a priori to the encoder and 
the decoder. Throughout the paper, we refer to this problem 
setup as Distributed Network Compression with Correlated 
Parameters (DNC-CP). We stress that the nature of DNC- 
CP is fundamentally different from those addressed by the 
Slepian-Wolf (SW) theorem in [ll. Here, instead of symbol- 
by-symbol correlation between the sequences as in SW setup, 
we target to remove the redundancy incurred by the universal 
compression of finite-length sequences, whose dependency 
is due to the correlation of their unknown source parameters 
that are a priori unknown II2TI . ||23]| . ||24l . Note that as the 
length of the sequence x" grows to infinity, the redundancy 
rate in the compression of x" vanishes since 
converges to the entropy rate as n 00, and hence, the 
potential benefits of DNC-CP vanish as the sequence length 
grows, which contrasts the Slepian-Wolf framework where 
the benefits are studied in the asymptotic regime. 

III. Notations and Definitions 

Thus far, we described the basic problem setup. In this 
section, we provide further details involving notations and 
definitions. Following the notation in f25\, let ^ be a finite 
alphabet. Let d be the number of the source parameters. Let 
0"^ denote the space of d-dimensional parameter vectors. Let 
X € Q'^ denote a d-dimensional parameter vector. Let 
denote the family of sources that can be described with a 
d-dimensional unknown parameter vector A. We denote /iA 
as the probability measure that is defined by the parameter 



vector A under the parametric source model. Let X(A) denote 
the Fisher information matrix for parameter vector A. 

We assume that the parameter vector 9 ^ Q"^ (corre- 
sponding to source Si) follows the worst-case prior in the 
sense that it maximizes the expected redundancy (i.e., the 
capacity achieving prior in the maximin sense). This prior 
distribution is particularly interesting because it corresponds 
to the worst-case compression performance for the best 
compression scheme. We further assume that given 9, the 
parameter vector (j) ^ Q'^ (i.e., the parameter vector of 
source 5*2) follows a Gaussian distribution with mean 9 
and covariance matrix r{9). This models the nature of the 
correlation of the sources Si and 5*2 in our setup. Let J^{9) 
he a dxd matrix associated with the parameter vectors (f> and 
9, defined as J{9) = T{9)I{9). We assume that J{9) is a 
positive definite matrix. This assumption is necessary for the 
conditional distribution to be well defined. Let Id be the dxd 
identity matrix. We use the notation a;" — {xi, ...,a;„) G A" 
to present a sequence of length n from the alphabet A 
generated by 5*2. We further denote X" as a random sequence 
of length n that follows the probability distribution fi^. Let 
Hn{(f>) be the entropy of the source S2 given the parameter 
vector 0, i.e., ff„(0) = = Elog 

Let c„ : A" — J> {0,1}* be an injective mapping from 
the set A" of the sequences of length n over A to the set 
{0, 1}* of binary sequences. Further, let Ip^{x") denote the 
almost lossless length function of the codeword associated 
with the sequence with permissible error pe- In the study 
of coding strategies for DNC-CP, we compare the following 
relevant cases for the compression of the sequence from 
52 provided that the sequence y™ from 6*1 has akeady been 
memorized by the node M (in Fig. [TJ. 

« UComp (Universal compression without memorization), 
which only applies lossless universal compression to 
at S2 without using the side information at M. 

• DUCompMD (Distributed universal compression with 
memory at decoder), which assumes that decoder (at 
M) has access to context memory sequence while 
the encoder (at ^2) only knows m but does not know the 
exact sequence y™. The encoder then applies a universal 
code to a;" that is decoded at M by utilizing y™. 

• DUCompME (Distributed universal compression with 
common memory at both the decoder and the encoder), 
which assumes that the two encoders at and S2 can 
communicate, and thus, the decoder (at M) and the 
encoder (at ^2) have access to a shared sequence y™, 
which is utilized in the compression of at 5*2. 

In this paper, we use the average minimax redundancy as 
the performance metric for the different coding strategies. 
Let L5^° denote the space of universal almost lossless length 
functions on a sequence of length n, with permissible decod- 
ing error pe- Denote Rnilfi", (f>) as the expected redundancy 
of the almost lossless code on a sequence of length n for the 

-Throughout this paper, all expectations are taken with respect to the 
probability measure /j^, and log(-) denotes the logarithm in base 2. 



parameter vector 0, i.e., Rn{lP'',(p) = E/P''(X") - iJ„(0). 
Accordingly, the average minimax redundancy, which corre- 
sponds to the performance of the best code over the worst 
parameter vector is defined as follows. 



-Rucomp(") - ,pinf ^ sup RniCi 



(1) 



We denote ^ucompl"-) '^^e average minimax redundancy 
when the compression scheme is restricted to be strictly 
lossless instead of almost lossless, i.e., Pe = 0. 

In DUCompMD, let f^^^ p : x N x R''^'' ^ M. 
Note that in this case, the sequence y"' is not known to 
the encoder while the length m is still available to the 
encoder Denote the lossless universal length function with 
a memorized sequence of length m that is only available to 
the decoder with permissible error probability pe- Further, 
denote L^^„-^ p as the space of such lossless universal length 
functions. Denote Rn{V^m r' ^) expected redundancy 

of encoding a sequence x" of length n using the length 
function p . Further, let i?DUCompMD("' T) denote the 
expected minimax redundancy, i.e.. 



DUCompMD 



(n,m,r)= inf sup i?„(f^"^ p, I 



Likewise, let ll', p : A'^ y, A"^ y.W^'"^ 



(2) 

be the lossless 



universal length function with a shared memory of length m 
and permissible error probability Pe and covariance matrixF. 
Denote LF^^_^ p as the space of lossless universal length 
functions on a sequence of length n with a shared memory of 
length m. Denote Rni^^m r' ^) '■^^ expected redundancy 
of encoding a sequence of length n form the source using 
the length function p. Let i?DucompME('^' ™' T) denote 
the expected minimax redundancy for the lossless universal 
length function with a memory size of length m shared 
between the encoder and the decoder, i.e.. 



DUCompME 



(n,m,r)= inf sup i?„(Z^^^ p, 6*). 



(3) 

Again, when we set = we refer to the strictly lossless 
case. The following is a trivial statement comparing the 
performance of almost lossless coding versus strictly lossless 
coding. 

Fact 1 For all of of the described coding strategies, the 
strictly lossless redundancy is an upper bound on the the 
redundancy of the almost lossless coding for any p^. 

The following trivial inequalities demonstrate that the 
redundancy decreases when side information is available 
to the decoder Moreover, if the side information is also 
available to the decoder, the redundancy is further decreased. 

Fact 2 Let pe > 0. Then, we have 

RDUCompMEin,m,T) < RluCompMoi'^^'^^^) < K'compi^) ■ 



IV. Main Results 

In this section, we evaluate the performance of each of 
the different coding schemes introduced in the previous 
section for the DNC-CP problem using their corresponding 
average minimax redundancy for both almost lossless and 
strictly lossless codes. We treat the strictly lossless codes 
(i.e., Pe = 0) separately since they are interesting on their 
own. Some of the proofs are omitted due to the lack of space. 
All these results are valid for finite-length n (as long as n is 
large enough to satisfy the central limit theorem criteria). 

A. Strictly Lossless DNC-CP 

1) UComp: In this case, the side information sequence is 
not utilized at the decoder for the compression of x", and 
hence, the minimum number of bits required to represent 
X" is H{X") = + /(X"; 0). Thus, i?°comp(^i) = 

suP(^(0) -f (^"; (f)- Thus, it is straightforward to show the 
following (241, USD 

Theorem 1 The average minimax redundancy for strictly 
lossless UComp coding strategy is 



^UCompi''^) 



-logf— 

2 ^V27re/ 



-lOE 



2) DUCompMD: Next, we confine ourselves to strictly 
lossless codes in the DUCompMD strategy. In fi251 . we 
established a result that the memorization of at the 
decoder does not provide any benefit on the strictly lossless 
universal compression of the sequence x" from S2 when 
the parameter vectors are identical. It is straightforward to 
generalize that result as the following. 

Theorem 2 The average minimax redundancy for strictly 
lossless DUCompMD coding strategy is 

^DUCompMDi''^'''^T^) — ^UCompi''^) ■ 



3) DUCompME: Next, we present the main result on the 
strictly lossless codes for DUCompME coding strategy. In 
this case, since a random sequence is also known to 
the encoder, the achievable codelength for representing x" is 
given by H{X'^\Y"^). Then, the redundancy is given by the 
following theorem. 

Theorem 3 The average minimax redundancy for strictly 
lossless DUCompME coding strategy is 



Rn 



(n, m, F) ^ R{n, m,T) + O [ - + — 



1 



1 



DUCompME 



where the main redundancy term is given by 



R{n, m, F) — sup - log 

rh 2 



\ m 



(4) 



B. Almost Lossless DNC-CP 

In this case, we investigate the reduction in the average 
codelength associated with a sequence a;" as a result of the 
permissible error probability pe- 

1) UComp: We demonstrate the following lower bound 
on the redundancy. 

Theorem 4 The average minimax redundancy for almost 
lossless UComp coding strategy is lower bounded by 



Proof: Please refer to the Appendix for the proof. ■ 
2) DUCompMD: In this case, we proved in [25] that the 
permissible error probability pe potentially results in further 
reduction in the average codelength. The generalization of 
that result for the sources with correlated parameters is given 
by the following theorem. 

Theorem 5 The average minimax redundancy for almost 
lossless DUCompMD coding strategy is upper bounded by 

^DUCompMoi^'^^'^) < R{n,m,r)+J^{d,Pe) + ^-^ + 

where R(n, m, F) is the main redundancy term defined 
in d?]) and J-{d,pe) is the penalty due to the encoders not 
communicating given by 

J-(d,Pe) = ^l0gfl + -^l0g-y (5) 

2 V dloge pej 



3) DUCompME: We have the following lower bound. 

Theorem 6 The average minimax redundancy for almost 
lossless DUCompME coding strategy is upper bounded by 

- h{pe) -peHn{4>). 



A. Strictly Lossless 

In the case of UComp, Theorem [T| determines the achiev- 
able average minimax redundancy for the compression of 
a sequence of length n encoded regardless of the previous 
sequence j/'". In other words, UComp is an end-to-end 
universal compression scheme which does not use memo- 
rization. Hence, UComp is used as the benchmark for the 
performance of DUCompMD and DUCompME, which are 
memory-assisted network compression techniques. 

According to Theorem |2] in DNC-CP, if strictly lossless 
codes are to be used for the compression of x" from ^2, 
the memorization of the previous sequences from 5*1 by 
the decoder does not provide any benefit, assuming that 
the two encoders at 5*1 and 5*2 do not communicate (i.e., 
DUCompMD). In other words, the best that 5*2 can do for 
the strictly lossless compression of .t" is to simply apply a 
traditional universal compression. 

Theorem [3] determines the main redundancy term in the 
strictly lossless DUCompME coding strategy. It can be de- 
duced from Fact |2] thatthat if the two encoders communicate 
(i.e., DUCompME), the performance of strictly lossless com- 
pression of a;" would improve with respect to UComp. It is 
straightforward to see that as m grows, the main redundancy 
term in (|4|l decreases. However, the main redundancy term 
for very large memory (i.e., m — s> oo) is given by 

i?(n,oo,F) =:supilog|/d + nJ(A)|, (6) 

A ^ 

which remains non-zero in general. Therefore, increasing m 
beyond a certain limit does not provide further performance 
improvement. In summary, for the strictly lossless case, 
only DUCompME is interesting as it offers benefit over 
UComp but it is not practical as it requires the encoders 
to communicate. 

B. Example 1: Identical Source Parameters 

In this special case, we assume that the source parameters 
and are identical, and hence, J{9) = T{6) = Od- 
The performance of strictly lossless DUCompME coding 
strategy and the almost lossless DUCompMD coding strategy 
is quantified by R{n, m, 0^), which is given in the following 
proposition, giving back what was proved in i25l . 

Proposition 7 The main redundancy term of d?]) for the 

identical source parameters is given by 

A/ ^ , / 'ri\ 

R{n,m,Od) = - log 1 + - . 

2 V m/ 



V. Discussion on the Results 

In this section, we provide some discussion on the sig- 
nificance of the results for different DNC-CP coding strate- 
gies. We discuss the strictly lossless case followed by two 
examples that illustrate the impact of the source parameter 
correlation on the results of the almost lossless and strictly 
lossless schemes. 



We further consider the redundancy for large to. It can be 

shown that we have lim,„^oo -RoucompMEl"' Od) = 0- I" 
other words, since the parameter vector will be known to both 
the encoder and the decoder, the code's redundancy vanishes 
similar to the Shannon code0 In this case, the fundamental 

^Note that we have ignored the integer constraint on the length functions 
in this paper, which will result in a neghgible 0(1) redundancy that is 
exactly analyzed in [30J, (311 . 



limits are those of known source parameters and universality 
no longer imposes a compression overhead. 

C. Example 2: Correlation Covariance Matrix Inversely Pro- 
portional to Fisher Information Matrix 

Next, we consider the case where the covariance matrix 
V{9) is inversely proportional to the Fisher information 
matrix, i.e., T~^{6) — al{9). In this case, the two parameter 
vectors can be viewed as estimates of each other 

Proposition 8 The main redundancy term of for the case 
where T~^{d) = al{9) is given by 

R{n, m, -I-i = - log 1 + - + - . 
a 2 \ m a/ 

Hence, as the correlation between the two parameters in- 
creases, the redundancy decreases and eventually converges 
to that of the identical source parameters. 

VI. Conclusion 

In this paper, we introduced and studied the problem of 
Universal Compression of Distributed Sources with Corre- 
lated Parameters (DNC-CP). In DNC-CP, the correlation of 
the two source parameters becomes relevant due to the finite- 
length universal compression constraint. This model departs 
from the nature of the correlation in the SW framework. 
For DNC-CP, involving two correlated sources, we inves- 
tigated the average minimax redundancy. We demonstrated 
that memorization at the intermediate nodes in the network 
can help to noticeably improve the performance of the 
universal compression on multiple sources whose parameters 
are correlated. On the other hand, we did not provide a coding 
strategy that achieves the performance limits derived in this 
paper. 
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Appendix 

Proof of Theorem|4] 

In order to prove this theorem, we consider 
Note that both X" and le(X") 
are deterministic functions of X" and hence 

=i7(X"). (7) 

On the other hand, we can also use the chain rule in a 
different order to arrive at the following. 

(8) 

Hence, 

H{X"-) = H{X") ~ H{1^{X")\X") - H{X"-\1{X"),X") 

> H{X") - h{pe) - H{X"\1{X"),X'') (9) 

> i/(X")-/^(pe)~Pe-ff(^"), (10) 



where the inequality in (|9]l is due to the fact that 
H{le{X")\X") < = h{pe) and the inequaUty 

in ( [Tol l is due to Lemma [T] 

Lemma 1 (X"|le(X"), 1") < pei?(^"). 
Proof: 

1") = (1 - pe)ff (^"|le(X", ) = 0, 1") 

+ Pei/(^"|le(^")-l,^") (11) 
< PeH{X"). (12) 

The first term in (fTTl i is zero since if le(X") = 0, we 
have X" = 1" and hence ) = 0,X") = 

0. The inequality in ( fT2] i then follows from the fact that 
= < completing the proof. 

■ 

The proof of the theorem is completed by noting that 

HiX-)=H^i0)+R°{n). 



